Trivia About Trivia

The Backstory

Throughout college, every Tuesday night I could, I would go to a bar in downtown Gainesville with a group of friends to play trivia. While at trivia one night we began talking about what makes a good trivia team. We decided that there were certain topics that always popped up in trivia and a good team would have one or two people who knew a lot about those topics. The ones we decided were:

  • sports
  • offbeat, esoteric movies and TV
  • geography
  • history
  • current events

The Setup

As you could imagine, immediately after having this conversation, we began arguing about what player was the most important. So I began thinking....

"What topics come up in trivia the most?"

As a trivia nerd, I've spent more than my fair share of time watching Jeopardy. And when I found a dataset of all the Jeopardy questions and answers used in the shows 15 year history, I knew what I had to do.

Exploring the Data

First things first, we need to bring the csv in and clean it up a bit. First, we can remove all the punctuation and standardize the data a bit. Then there's a few pieces that aren't super relevant to our project, so let's get those out.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)
Out[2]:
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams
In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']
jeopardy.columns
Out[3]:
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Now that we only have the columns we want, I'm going to remove some of the punctuation to make things easier on down the road

In [4]:
import re

def removePunct(word):
    word = re.sub('[^A-Za-z0-9\s]','', word)
    word = word.lower()
    return word
In [5]:
jeopardy['clean_question'] = jeopardy['Question'].apply(removePunct)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(removePunct)
In [7]:
jeopardy.head(5)
Out[7]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams
In [8]:
new = jeopardy[['clean_question', 'clean_answer']]
In [9]:
import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[9]:
True
In [10]:
new.head()
Out[10]:
clean_question clean_answer
0 for the last 8 years of his life galileo was u... copernicus
1 no 2 1912 olympian football star at carlisle i... jim thorpe
2 the city of yuma in this state has a record av... arizona
3 in 1963 live on the art linkletter show this c... mcdonalds
4 signer of the dec of indep framer of the const... john adams

Now that everything is a bit cleaner, let's get to some NLP work. First we'll tokenize the strings and remove some stop words to help with efficiency. I played around with stemming or lemmatizing the data but decided that it ended up hurting the end result more than it helped.

In [11]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
In [12]:
def removeStop(para):
    words = word_tokenize(para)
    useful_words = []
    for i in words:
        if i not in stopwords.words('english'):
            useful_words.append(i)
    return (' ').join(useful_words)
    
In [13]:
new['final_question'] = new['clean_question'].apply(removeStop)
new['final_answer'] = new['clean_answer'].apply(removeStop)
/Users/allisonkahn/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
/Users/allisonkahn/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [14]:
new.head()
Out[14]:
clean_question clean_answer final_question final_answer
0 for the last 8 years of his life galileo was u... copernicus last 8 years life galileo house arrest espousi... copernicus
1 no 2 1912 olympian football star at carlisle i... jim thorpe 2 1912 olympian football star carlisle indian ... jim thorpe
2 the city of yuma in this state has a record av... arizona city yuma state record average 4055 hours suns... arizona
3 in 1963 live on the art linkletter show this c... mcdonalds 1963 live art linkletter show company served b... mcdonalds
4 signer of the dec of indep framer of the const... john adams signer dec indep framer constitution mass seco... john adams

Building the Name Entity Recognizer

In [15]:
import spacy
from spacy import displacy
from collections import Counter
In [19]:
all_ners = []

nlp = spacy.load('en')

for ex in new['final_question']:
    doc = nlp(ex)
    for chunk in doc.noun_chunks:
        all_ners.append(chunk.text)
In [20]:
len(all_ners)
Out[20]:
41905
In [50]:
c = Counter(all_ners)
c = c.most_common(200)

There's a bunch of clutter in there that got past the stopwords wall the first time around. I'm just going to get rid of the obvious ones off the top and then we'll go through this with the answers.

In [41]:
c = c[9:]
c = dict(c)
In [47]:
all_ners = []

for ex in new['final_answer']:
    doc = nlp(ex)
    for chunk in doc.noun_chunks:
        all_ners.append(chunk.text)
        
c2 = Counter(all_ners)
c2 = c2.most_common(200)
c2 = dict(c2)

Visualizing the Results

In [43]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

word_could_dict= c
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)

plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()
In [48]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

word_could_dict= c2
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)

plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()

Wrap Up

And that wraps up this analysis! There's plenty of other routes to go down with this dataset that could lead to some interesting findings. With the results of our NER, we know that in order to have the best shot at Jeopardy!, you should focus on:

  • US States
  • Countries (specifically Western Europe, English-speaking or East Asian)
  • Eurocentric names
  • Royal lineages

Of course, NERs are suseptible problems like anything else. This analysis picks up phrases that appear most often, not necessarily topics that appear most often. We could utilize a knowledge base or other method of tracing larger topics to find a better list of subject.

Thanks for reading!

Let me know if you have any comments, questions or thoughts!

allison.kahn.12@gmail.com